#>
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#>
#> intersect, setdiff, setequal, union
Joschka Schwarz
Once you’ve started learning tools for data manipulation and visualization like dplyr and ggplot2, this course gives you a chance to use them in action on a real dataset. You'll explore the historical voting of the United Nations General Assembly, including analyzing differences in voting between countries, across time, and among international issues. In the process you'll gain more practice with the dplyr and ggplot2 packages, learn about the broom package for tidying model output, and experience the kind of start-to-finish exploratory analysis common in data science.
The best way to learn data wrangling skills is to apply them to a specific case study. Here you’ll learn how to clean and filter the United Nations voting dataset using the dplyr package, and how to summarize it into smaller, interpretable units.
Theory. Coming soon …
1. The United Nations Voting Dataset
Hi, I’m Dave Robinson and I’ll be your instructor for this course. I’m a data scientist and I really enjoy using R to dive into a dataset and discover interesting things. In this course, we’re going to be using some of my favorite R packages, such as dplyr and ggplot2, to explore and draw conclusions from a real-world dataset. If you’ve used these packages before, this will be a great opportunity to practice using them in an analysis.
2. UN Voting Dataset
Let’s introduce the dataset, which contains the historical voting data from the General Assembly of the United Nations. In the General Assembly every member nation gets a vote, which makes this a great opportunity to explore the history of international relations.In our data analysis vocabulary, rows of a dataset are called “observations” and columns are called “variables”. In this dataset, each observation represents one combination of a roll call vote and a country
3. UN Voting Dataset
The first variable, rcid, is the “roll call ID”.
4. UN Voting Dataset
describing one round of voting, such as to approve a UN resolution. The session variable represents which year-long session in the UN’s history the vote was cast. Note that to keep the dataset at a reasonable size, only sessions from alternating years are included.
5. UN Voting Dataset
The vote column represents that country’s choice.
6. UN Voting Dataset
For example, 1 means a yes vote, and 9 means a country was not a member of the United Nations. The ccode column is a country code
7. UN Voting Dataset
that uniquely specifies the country.
8. Votes in dplyr
To work with this in R, we’d start by loading the dplyr package, which offers tools for manipulating data. Then we can view the votes dataset by simply typing “votes” into the R prompt. Here you can see each of the columns of the table , as well as the table’s size - 508 thousand rows.As with almost any dataset you’ll run into, you’ll need to clean this data before we can start analyzing it. Let’s review one of the most important tools for performing multiple sequential steps on data: the pipe operator.
9. The pipe operator
The pipe, typed as “percent greater than percent”, tells R to pass one object in as the first argument of the next function,
10. The pipe operator
which lets us perform multiple operations in a series. While it may seem complicated if you haven’t used it much before, you’ll quickly get comfortable with it.
11. dplyr verbs
The operations we’ll usually be composing are dplyr’s “verbs”- functions that perform a single, simple action on a dataset. Recall that the “filter” verb subsets observations from a dataset, to remove rows that aren’t interesting to us.
12. dplyr verbs
The “mutate” verb adds a variable or changes an existing variable.Here’s an example of each.
13. Original data
In our original dataset, the vote column has five possible values : 1 for yes, 2 for abstain, 3 for no, 8 meaning the country wasn’t present, and 9 meaning the country was not a member. We only care about the first three values- yes, no and abstain.
14. dplyr verbs: filter
To remove the others, we pipe the dataset into the filter function. Within that filter we describe a condition: vote <= 3. The resulting data frame is smaller - it only kept the observations where our condition was met.
15. dplyr verbs: mutate
You’ll also be using the mutate function. The session variable is hard to interpret, but if you know the first session of the United Nations was held in 1946, you can use it to get the year each vote was cast, which is much more interpretable. To do this you could pipe the data into the “mutate” function, where you can define your new “year” column as 1945 + the session. Notice the new “year” column with the result. In your exercises, you’ll also clean up the country column to include full country names instead of IDs.
16. Chaining operations in data cleaning
The pipe operator lets you chain these simple actions together in a sequence. You’ll get into the habit of piping many small, simple operations together to perform a richer analysis.
17. Let’s practice!
The vote column in the dataset has a number that represents that country’s vote:
One step of data cleaning is removing observations (rows) that you’re not interested in. In this case, you want to remove “Not present” and “Not a member”.
Steps
votes table.#>
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#>
#> intersect, setdiff, setequal, union
The next step of data cleaning is manipulating your variables (columns) to make them more informative.
In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.
Steps
mutate() to add a year column by adding 1945 to the session column.The country codes in the ccode column are what’s called Correlates of War codes. This isn’t ideal for an analysis, since you’d like to work with recognizable country names.
You can use the countrycode package to translate. For example:
#> [1] "United States"
#> [1] "United States"
# Translate multiple country codes
countrycode(c(2, 20, 40), "cown", "country.name")#> [1] "United States" "Canada" "Cuba"
Created on 2022-02-27 by the reprex package (v2.0.1)
Steps
countrycode package.country column in your mutate() statement containing country names, using the countrycode() function to translate from the ccode column. Save the result to votes_processed.# 1. Load the countrycode package
library(countrycode)
# 2. Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
filter(vote <= 3) %>%
mutate(year = session + 1945,
votes_processed = countrycode(ccode, "cown", "country.name"))#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 260
Theory. Coming soon …
1. Grouping and summarizing
In your last exercises you cleaned up the raw data into create a processed set of votes,
2. Processed votes
which looked like this. Now we can start trying to pull real insights out of the data.There are far too many observations in this dataset to extract anything interpretable by looking through it manually, so we’ll need to choose a way to summarize it that’s interesting to us. Here I’ll propose a simple metric we’ll be using a lot in this course:
3. Using “% of Yes votes” as a summary
“percentage of yes votes.” If a country votes yes on most resolutions, we might infer that it tends to agree with the international consensus, while if it votes no we could assume that it tends to go against it.
4. dplyr verb: summarize
To calculate this you’ll use another dplyr verb: summarize. Summarize takes many rows and turns them into one - while calculating overall metrics, such as an average or total.
5. dplyr verbs: summarize
For example, we can pipe the votes_processed data into a summarize operation, telling it to create a new variable called total. n is a special function within a summarize that means “the number of rows.” The result is a one-row data frame telling us the total number of rows - 353 thousand.
6. dplyr verbs: summarize
We can add another variable to this summary with our “percentage yes” variable. Since 1 is “yes” in our dataset, we want the percentage of the rows where the vote variable is equal to 1. The way to calculate this in R is “mean vote equals equals 1”. (If you’d like to know why, this is because it first compares each vote to 1 to get true or false, then treats true cases as “ones” and falses as “zeroes”.). By calculating this, we see that about 79-point-9 percent of United Nations votes in history were “yes” votes.This overall summary isn’t much information. We may want to know whether this percentage has changed over time.
7. dplyr verb: group_by
So we introduce another verb- group_by. When done before a summarize operation, this tells the summarize to create one row for each sub-group, instead of one row overall.
8. dplyr verbs: group_by
For example, here we perform the same summary, but first group by year before summarizing. Now instead of getting one row overall, we get one row for each year: we see that 56-point-9% of votes in 1947 were yes, but only 43-point-8% in 1949. In later lessons you’ll use this to visualize the trend in the percentage over time.Summarizing by subgroups is a powerful way to turn large datasets into smaller ones that you can interpret. In your exercises, you’ll try grouping by country instead of year, which shows you which countries are more prone to voting “yes” or “no”.
In this analysis, you’re going to focus on “% of votes that are yes” as a metric for the “agreeableness” of countries.
You’ll start by finding this summary for the entire dataset: the fraction of all votes in their history that were “yes”. Note that within your call to summarize(), you can use n() to find the total number of votes and mean(vote == 1) to find the fraction of “yes” votes.
Steps
votes_processed dataset that you created in the previous exercise.Summarize the dataset using the summarize() function to create two columns:
total: with the number of votespercent_yes: the percentage of “yes” votesThe summarize() function is especially useful because it can be used within groups.
For example, you might like to know how much the average “agreeableness” of countries changed from year to year. To examine this, you can use group_by() to perform your summary not for the entire dataset, but within each year.
Steps
group_by() to your code to summarize() within each year.In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries.
Steps
summarize() within each country rather than within each year. Save the result as by_country.Theory. Coming soon …
1. Sorting and filtering summarized data
In your last exercise,
2. by_country dataset
you created a dataset called by_country, containing one row for each country with the total number of votes and the percentage of votes that were yes. Now you might be interested in knowing which country voted yes the most or least often.
3. dplyr verb: arrange()
To discover this we’ll introduce one more dplyr verb: arrange. Arrange sorts a dataset based on one of its variables, in either ascending or descending order. This is useful for pulling a few interesting conclusions out of your data.
4. arrange()
Here, we could pipe by_country to the arrange operation, telling it to sort by the percent_yes column. We’d see that Zanzibar is the country that voted yes the least often in our dataset, followed by the United States. But we might also notice that Zanzibar only had two votes in our entire dataset, which means that 0% is basically meaningless! This is a very common way that summarized data can trip you up, and why you have to be careful about interpreting your results too quickly.To fix this, in your exercises you’ll have to filter the dataset to remove countries with a low total, just like you earlier used filter to remove vote rows we didn’t care about.
5. Transforming tidy data
Notice therefore that filter isn’t just useful for cleaning your raw data, but also for manipulating your summarized data. It’s therefore important to get comfortable using each of these dplyr verbs at all the stages of an analysis.
6. Let’s practice!
Now that you’ve summarized the dataset by country, you can start examining it and answering interesting questions.
For example, you might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.
Steps
by_country dataset created in the last step.arrange() to sort the countries in ascending order of percent_yes.In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can’t make any substantial conclusions based on that data!
Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.
Steps
filter() to remove from the sorted data countries that have fewer than 100 votes.Once you’ve cleaned and summarized data, you’ll want to visualize them to understand trends and extract insights. Here you’ll use the ggplot2 package to explore trends in United Nations voting within each country over time.
Theory. Coming soon …
1. Visualization with ggplot2
In the last chapter,
2. By-year data
you created a dataset showing the percentage of yes votes in each year. While this isn’t a “large” dataset by typical standards, it’s still difficult to read through it and get a sense of a trend over time, or to communicate that trend to others. Instead, you want to visualize the data, into a line plot like this
3. Visualizing by-year data
4. Visualizing by-year data
We’ll use the ggplot2 package, which uses the ggplot function to construct a graph. A call to ggplot has three parts. First is the data frame, which we’ve already constructed as by_country.Second is the mapping of variables in that data frame, such as year and percent yes, to the visual dimensions of the plot like the x and y axes, which we call “aesthetics”. This is done in an “aes” call, where we chose to put year on the x axis and percent_yes on the y-axis.The third part of a ggplot call is to add layers onto the plot. Here we add geom_line - where geom_ means we’re choosing which geometric objects to add to the plot. In your exercises you’ll try changing the layer you add, such as creating a scatter plot with points rather than a line plot.
You’re going to create a line graph to show the trend over time of how many votes are “yes”.
2.3 Question
Which of the following aesthetics should you map the
yearvariable to?
⬜ Color
⬜ Width
✅ X-axis
⬜ Y-axis
Right! To plot a line graph to show the trend over time, the year variable should be on the x-axis.
In the last section, you learned how to summarize() the votes dataset by year, particularly the percentage of votes in each year that were “yes”.
You’ll now use the ggplot2 package to turn your results into a visualization of the percentage of “yes” votes over time.
Steps
The by_year dataset has the number of votes and percentage of “yes” votes each year.
ggplot2 package.ggplot() with the geom_line layer to create a line plot with year on the x-axis and percent_yes on the y-axis.A line plot is one way to display this data. You could also choose to display it as a scatter plot, with each year represented as a single point. This requires changing the layer (i.e. geom_line() to geom_point()).
You can also add additional layers to your graph, such as a smoothing curve with geom_smooth().
Steps
Theory. Coming soon …
1. Visualizing by country
You’ve been able to plot the trend of percentage_yes over time, but only for the United Nations as a whole. Mixing all countries into one trend doesn’t tell us much about international relations
2. Examining by country and year
What if we wanted to plot the trend only for one country, such as the United States, to find out how its relationship with the United Nations has changed over time?First you’ll have to change our summary operation to structure our data appropriately.
3. Summarizing by country and year
You’ve summarized by year before, and by country. Now, you’re going to summarize by both, by adding year and country to the group_by operation. This gets a data frame with one row for each unique combination of year and country- for example, just for Afghanistan in 1947.
4. Filtering for one country
Once we have this data, you can extract the votes for just one country- such as the United States- with a filter operation. This data is then easy to visualize the same way you visualized overall trends in the last exercises.This by_year_country data gives us even more options, though: instead of plotting one country at a time, we can plot multiple.
5. The %in% operator
Let’s introduce the %in% operator, written as percent in percent. This lets us take one vector and determine which of its items are in another vector. For example, here it would determine that the second and fifth elements, B and E, are in the second vector.
6. Filtering for multiple countries
The %in% operator thus lets us filter for multiple countries- here you are filtering only for the United States and France, which end up as the only rows in our data frame. Don’t forget “c” in our designation of United States and France: that’s just the R way of defining a vector.
7. Visualizing vote trends by country
Once you’ve created the dataset you’ll want to visualize them with ggplot2. To show both countries on the same plot and distinguish them, you’ll need to add another aesthetic besides x and y to your aes call. In this case a good choice is color. By adding “color = country” to our aesthetics, you can plot both lines on one graph, with a legend distinguishing the two. This makes it easy to compare and contrast the two trends. You could use this flexible approach of filtering and graphing to compare any number of countries.
You’re more interested in trends of voting within specific countries than you are in the overall trend. So instead of summarizing just by year, summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.
Steps
by_year_country.Now that you have the percentage of time that each country voted “yes” within each year, you can plot the trend for a particular country. In this case, you’ll look at the trend for just the United Kingdom.
This will involve using filter() on your data before giving it to ggplot2.
Steps
by_year_country dataset.UK_by_year.Plotting just one country at a time is interesting, but you really want to compare trends between countries. For example, suppose you want to compare voting trends for the United States, the UK, France, and India.
You’ll have to filter to include all four of these countries and use another aesthetic (not just x- and y-axes) to distinguish the countries on the resulting visualization. Instead, you’ll use the color aesthetic to represent different countries.
Steps
by_year_country called filtered_4_countries with just the countries listed in the editor (you may find the %in% operator useful here).# Vector of four countries to examine
countries <- c("United States", "United Kingdom",
"France", "India")
# 1. Filter by_year_country: filtered_4_countries
filtered_4_countries <- by_year_country %>%
filter(country %in% countries)
# 2. Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(year, percent_yes, color = country)) +
geom_line()Theory. Coming soon …
1. Faceting by country
In the last exercise you learned to plot multiple countries, distinguishing them by color. This is great for two or three countries,
2. Graphing many countries
but consider this graph where the trends of six countries are compared to each other. I don’t know about you, but I find this hard to interpret- the overlapping lines are difficult to distinguish, and I find myself forgetting which color represents which country.Instead, let’s introduce an alternative approach: faceting,
3. Graphing many countries
or creating “sub-plots” for each country. To facet, add an additional option with + to the end of the plot: facet_wrap. Here you’ll use a tilde country: in R the tilde means “explained by”, which says that we want to divide the graph into one subplot by country. When the six countries are divided onto separate subplots, it becomes a lot easier to understand each country’s trend.You might notice that all six graphs have the same y-axis, even though they cover different ranges. This leads to “wasted space” within each graph, where the trend in particular countries is compressed because of the patterns in other countries.
4. Graphing on separate scales
To avoid this, you can add a second argument, scales = “free_y”. This lets the y-axis vary between each subplot, and use all the space
5. Graphing on separate scales
it has available
6. Graphing on separate scales
There are advantages and disadvantages to this approach - while there’s less wasted space within each subplot, it can also be misleading while comparing between them- but it’s an option worth being aware of.Faceting is a powerful tool, and in the exercises you’ll see that it is capable of plotting and comparing even a large number of countries.
7. Let’s practice!
Now you’ll take a look at six countries. While in the previous exercise you used color to represent distinct countries, this gets a little too crowded with six.
Instead, you will facet, giving each country its own sub-plot. To do so, you add a facet_wrap() step after all of your layers.
Steps
filtered_6_countries.# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~country)In the previous plot, all six graphs had the same axis limits. This made the changes over time hard to examine for plots with relatively little change.
Instead, you may want to let the plot choose a different y-axis for each facet.
Steps
# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ country, scales = "free_y")The purpose of an exploratory data analysis is to ask questions and answer them with data. Now it’s your turn to ask the questions.
You’ll choose some countries whose history you are interested in and add them to the graph. If you want to look up the full list of countries, enter by_country$country in the console.
Steps
countries vector and therefore to the faceted graph.# Add three more countries to this list
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India", "Germany", "Austria", "Denmark")
# Filtered by_year_country: filtered_countries
filtered_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ country, scales = "free_y")While visualization helps you understand one country at a time, statistical modeling lets you quantify trends across many countries and interpret them together. Here you’ll learn to use the tidyr, purrr, and broom packages to fit linear models to each country, and understand and compare their outputs.
Theory. Coming soon …
1. Linear regression
In the last chapter,
2. Quantifying trends
you learned to visualize the trend of the “% yes” metric over time for individual countries, and see that Afghanistan’s agreement has generally going up while the United States has been going down. However, while it’s easy to recognize this trend visually, we haven’t yet quantified it. In this chapter, we’re going to learn to model this trend with a linear regression,
3. Linear regression
finding a “best fit” line for each country. For example, here we can see that
4. Linear regression
Afghanistan has a positive slope
5. Linear regression
and the US a negative slope.
6. Fitting model to Afghanistan
First, you can use filter to extract the per-year data for one country, in this case Afghanistan, into its own data frame.
7. Fitting model to Afghanistan
You can then use the lm function, short for “linear model”, to fit the line. We describe the model as “percent yes, tilde, year.”
8. Fitting model to Afghanistan
Percent yes is our dependent variable, on the y-axis. Next is the tilde- in R this means “explained by”. Then we have “year”,
9. Fitting model to Afghanistan
the independent variable, on the x-axis. This says we’re modeling “percent yes explained by year.”
10. Fitting model to Afghanistan
We can examine this model using the summary function, run on the model object we created with lm. There’s a lot of output- and if you have experience in R you may recognize some of it- but we’re going to focus on the CLICK coefficient table in the middle. Each row here represents a term that’s been estimated- a y-intercept and a slope. The term we’re most interested in is the year term, also known as the slope, showing how much the year affects percent_yes. First we have an estimated slope term. In R the e-3 describes scientific notation, meaning 10 to the negative three- this makes the slope point-006. This describes a positive slope of point-6% increase in % yes each year. We may also care about the p-value, which tests for statistical significance. We won’t talk much about the details of p-values in this course, but low p-values, such as this one, generally mean we can rule out that the effect is due to chance.Quantifying the trend is important,
11. Visualization can surprise you, but it doesn’t scale well.
because in the words of Hadley Wickham, “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you.” Now that you’ve visualized a few examples and know what you’re looking for, you can apply a model. In the course of this chapter we’ll learn to “scale” this analysis
12. Let’s practice!
to compare all countries in our dataset at once.
A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm() function in R.
Here, you’ll fit a linear regression to just the percentage of “yes” votes from the United States.
Steps
US_by_year data to the console.# Percentage of yes votes from the US by year: US_by_year
US_by_year <- by_year_country %>%
filter(country == "United States")
# 1. Print the US_by_year data
US_by_yearUS_by_year, use lm() to run a linear regression predicting percent_yes from year. Save this to a variable US_fit.US_fit using the summary() function.# Perform a linear regression of percent_yes by year: US_fit
US_fit <- lm(percent_yes ~ year, data = US_by_year)
# Perform summary() on the US_fit object
summary(US_fit)#>
#> Call:
#> lm(formula = percent_yes ~ year, data = US_by_year)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.222491 -0.080635 -0.008661 0.081948 0.194307
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.6641455 1.8379743 6.890 8.48e-08 ***
#> year -0.0062393 0.0009282 -6.722 1.37e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1062 on 32 degrees of freedom
#> Multiple R-squared: 0.5854, Adjusted R-squared: 0.5724
#> F-statistic: 45.18 on 1 and 32 DF, p-value: 1.367e-07
The US_fit object you created in the previous exercise is available in your workspace. Calling summary() on this gives you lots of useful information about the linear model.
3.4 Question
What is the estimated slope of this relationship? Said differently, what’s the estimated change each year of the probability of the US voting “yes”?
⬜ 12.664
✅ -0.006
⬜ 8.48e-08
⬜ 1.37e-07
Not all positive or negative slopes are necessarily real. A p-value is a way of assessing whether a trend could be due to chance. Generally, data scientists set a threshold by declaring that, for example, p-values below .05 are significant.
3.6 Question
In this linear model, what is the p-value of the relationship between
yearandpercent_yes?
⬜ 12.664
⬜ -0.006
⬜ 8.48e-08
✅ 1.37e-07
Theory. Coming soon …
1. Tidying models with broom
In our last section,
2. A model fit is a “messy” object
you learned to perform a linear regression and interpret the results, noticing in particular the estimate of the slope and the p-value in this coefficients table. However, while we were able to see these values in the printed output, we didn’t discuss how to extract them out within R.This is particularly important when combining multiple models.
3. Models are difficult to combine
If we had a linear regression for Afghanistan, for the United States, and for Canada, we wouldn’t have an easy way to combine these models, compare them, or visualize them.It’s possible to get these values out using built-in functions, but if you’re familiar with R you may recognize that there are some pitfalls that can make it unexpectedly difficult. There’s a tool that makes it particularly easy: my own broom package.
4. broom turns a model into a data frame
The broom package offers a function, tidy, that turns a linear model into a data frame of coefficients. In this case, the tidied coefficients have one row for the intercept and one for the slope, the ones we are interested in . Importantly, since this is a data frame, it is easy to extract values from it, and we’ll be able to use all our standard dplyr tools on it.In particular, this makes it possible to combine multiple models.
5. Tidy models can be combined
If you have two linear models, one for Afghanistan and one for the US, you could tidy each of them, and since the tidied models are the same shape they can be combined with dplyr’s bind_rows function. In the following sections you’ll build one model for each country and combine all of them.
In the last section, you fit a linear model. Now, you’ll use the tidy() function in the broom package to turn that model into a tidy data frame.
Steps
The US_fit linear model is available in your workspace.
broom package.tidy() function from broom on the model object to turn it into a tidy data frame. Don’t store the result; just print the output to the console.One important advantage of changing models to tidied data frames is that they can be combined.
In an earlier section, you fit a linear model to the percentage of “yes” votes for each year in the United States. Now you’ll fit the same model for the United Kingdom and combine the results from both countries.
Steps
UK_fit.US_fit into a data frame called US_tidied and the UK model into UK_tidied.bind_rows() from dplyr to combine the two tidied models, printing the result to the console.# 1. Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
filter(country == "United Kingdom")
UK_fit <- lm(percent_yes ~ year, UK_by_year)
# 2. Create US_tidied and UK_tidied
US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)
# 3. Combine the two tidied models
bind_rows(US_tidied, UK_tidied)Awesome! We can easily compare the two models now.
Theory. Coming soon …
1. Nesting for multiple models
In these next two sections, we’re going to discuss fitting many models: in particular,
2. One model for each country
fitting one model for each country. This will allow us to find the countries whose level of agreement with the rest of the United Nations is increasing or decreasing most dramatically.Fitting multiple models requires several steps.
3. Start with one row per country
First, we start with the by_year_country dataset, containing one row for each combination of year and country. We need to separate this data out by country so we can model them individually. But instead of just pulling out one, as we’ve done before, we’re going to split it into many small datasets, one for each country.
4. nest() turns it into one row per country
To do this, we use nest from the tidyr package. Using nest-negative country means to nest all the columns besides country. which means we end up with a data frame with one row for each country. All the other columns- year, total, and percent_yes- have been nested into a column called data. This is a list column, which we haven’t seen before. It allows each item in the column to itself be a data frame (specifically a tibble, dplyr’s version of a data frame)- containing the other columns. This means we now have a filtered version for Afghanistan, a filtered version for Argentina, and so on. In the next lesson this will allow us to fit a model to each.
5. unnest() does the opposite
Later we’ll want to take a nested list column and bring the rows from each individual back into the “top level” of the data frame. This is done with the function unnest. Pipe the table in, saying you want to unnest the data column, and it takes each of those sub-tables and puts their rows back into the main table, where we get to the data we started from.You might be wondering why we nested the data frame only to reverse it right after. In the next lesson we’ll add a step between the nesting and unnesting, where we fit a model to each sub-table and tidy it, that will make this process useful
Right now, the by_year_country data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data.
Steps
tidyr package.nest() function to nest all the columns in by_year_country except country.This “nested” data has an interesting structure. The second column, data, is a list, a type of R object that hasn’t yet come up in this course that allows complicated objects to be stored within each row. This is because each item of the data column is itself a data frame.
# A tibble: 200 × 2
country data
<chr> <list>
1 Afghanistan <tibble [34 × 3]>
2 Argentina <tibble [34 × 3]>
3 Australia <tibble [34 × 3]>
4 Belarus <tibble [34 × 3]>
5 Belgium <tibble [34 × 3]>
6 Bolivia, Plurinational State of <tibble [34 × 3]>
7 Brazil <tibble [34 × 3]>
8 Canada <tibble [34 × 3]>
9 Chile <tibble [34 × 3]>
10 Colombia <tibble [34 × 3]>You can use nested$data to access this list column and double brackets to access a particular element. For example, nested$data[[1]] would give you the data frame with Afghanistan’s voting history (the percent_yes per year), since Afghanistan is the first row of the table.
Steps
data column that contains the data for Brazil.The opposite of the nest() operation is the unnest() operation. This takes each of the data frames in the list column and brings those rows back to the main data frame.
In this exercise, you are just undoing the nest() operation. In the next section, you’ll learn how to fit a model in between these nesting and unnesting steps that makes this process useful.
Steps
data list column, so that the table again has one row for each country-year pair, much like by_year_country.Theory. Coming soon …
1. Fitting multiple models
In the last exercises you nested a data frame
2. nest() turns data into one row per country
to create many smaller data frames, one for each country. Recall, for example, that the first item in the data column was a table of Afghanistan’s per-year data.Now you want fit a model on each of these one-country datasets- fitting one linear model for Afghanistan’s data, one for Argentina, and so on.To fit a model for each item in a list column, you’ll use the purrr package, which offers tools for working with functions and lists. In particular, you’ll use the map function.
3. map() applies an operation to each item in a list
map lets you apply an operation to each item in a list. For example, if you had a list v with values 1, 2, and 3, you could use map and the expression “tilde dot times 10”. The tilde and dot combination is a way of defining an operation, where the dot represents each item in the list- first 1, then 2, then 3. Thus the expression means “multiply each item by 10”- turning 1, 2, 3 into 10, 20, 30.Map is therefore useful any time you want to do something to each item of a list.
4. map() fits a model to each dataset
Here we want to fit a linear model into a new column based on each sub-data frame. We use mutate to define the new column “model”, and use map to apply a linear regression to each item of “data”. We describe the linear regression with tilde then our linear regression, the same kind we’d run on one data field, with dot as the data. This creates a new column of linear models - one for each sub-data frame. So the first item would contain the slope just for Afghanistan. It’s nice that we’ve fit these models, but we can’t combine them, manipulate them, or visualize them. That’s why we return to the broom package,
5. tidy turns each model into a data frame
which takes each model and turns it into a tidy data frame of coefficients. We use map one more time to create another list column, calling this one “tidied”. So now for each country, we have three columns: one with the original data, one with a linear model, and one with the tidied model.Tidied versions of statistical models are easy to combine, so
6. unnest() combines the tidied models
just like in the last lesson we can use unnest to bring them all into the top level. Now we have a table of coefficients, where the first two rows represent the slope and intercept for Afghanistan, the next two rows for Argentina, the next two for Australia, and so on: all of the details of each model in one place.This was four steps: nest by country, map to fit a model to each dataset, map to tidy each model, unnest to a table of coefficients. It’s a complicated process, but it let us get information about each country- how it was changing over time - in a way much more complicated than our earlier group by and summarize allowed.
Now that you’ve divided the data for each country into a separate dataset in the data column, you need to fit a linear model to each of these datasets.
The map() function from purrr works by applying a formula to each item in a list, where . represents the individual item. For example, you could add one to each of a list of numbers:
This means that to fit a model to each dataset, you can do:
where . represents each individual item from the data column in by_year_country. Recall that each item in the data column is a dataset that pertains to a specific country.
Steps
tidyr and purrr packages.map() function within a mutate() to perform a linear regression on each dataset (i.e. each item in the data column in by_year_country) modeling percent_yes as a function of year. Save the results to the model column.You’ve now performed a linear regression on each nested dataset and have a linear model stored in the list column model. But you can’t recombine the models until you’ve tidied each into a table of coefficients. To do that, you’ll need to use map() one more time and the tidy() function from the broom package.
Recall that you can simply give a function to map() (e.g. map(models, tidy)) in order to apply that function to each item of a list.
Steps
broom package.map() function to apply the tidy() function to each linear model in the model column, creating a new column called tidied.# Load the broom package
library(broom)
# Add another mutate that applies tidy() to each model
by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .))) %>%
mutate(tidied = map(model, tidy))#> Warning: All elements of `...` must be named.
#> Did you want `data = -country`?
You now have a tidied version of each model stored in the tidied column. You want to combine all of those into a large data frame, similar to how you combined the US and UK tidied models earlier. Recall that the unnest() function from tidyr achieves this.
Steps
unnest() step to unnest the tidied models stored in the tidied column. Save the result as country_coefficients.country_coefficients object to the console.# Add one more step that unnests the tidied column
country_coefficients <- by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)#> Warning: All elements of `...` must be named.
#> Did you want `data = -country`?
Theory. Coming soon …
1. Working with many tidy models
In the last exercises
2. We have a model for each country
you created a combined dataset, called country_coefficients, of the details of each per-country model, with rows for the slope and intercept for each country. Since the data is tidy, you can manipulate these coefficients with dplyr operations just like you did the original voting data.For example, in this analysis we’re interested in how countries change over time (the slope) not where they started- the intercept. So
3. Filter for the year term (slope)
we can use dplyr’s filter to get only the cases where term equals year- the ones describing how year affected percent_yes. Thus- filter for term == “year”. Not all of these slopes can be trusted- some may be due to random noise. We may want to get only the models that were statistically significant. Recall that the p-value of a model is a common metric for whether it is due to noise- we often require that the p-value be less than point-05 to call a trend significant.Here we run into a common issue you may be familiar with- when we run many statistical tests and evaluate their p-values, we need to do a multiple hypothesis correction. This is a complicated problem that is outside the scope of this course, but the basic issue is that if you try many tests, some p-values will be less than point-05 by chance, meaning we need to be more restrictive.R provides a useful built-in function for p-value correction, called p-dot-adjust.
4. Filtered by adjusted p-value
By filtering for cases where the adjusted p-value is less than point-05, we can feel more safe in our assumptions, and get a set of country trends that we believe are real.Using dplyr operations to work with many model outputs is a powerful way to draw conclusions out of a large dataset. In your exercises you’ll also use arrange to find the countries with the strongest upward and downward trends over time.
You currently have both the intercept and slope terms for each by-country model. You’re probably more interested in how each is changing over time, so you want to focus on the slope terms.
Steps
country_coefficients data frame to the console.filter() step that extracts only the slope (not intercept) terms.Not all slopes are significant, and you can use the p-value to guess which are and which are not.
However, when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that you can trust.
Here you’ll add two steps to process the slope_terms dataset: use a mutate to create the new, adjusted p-value column, and filter to filter for those below a .05 threshold.
Steps
p.adjust() function to adjust the p.value column, saving the result into a new p.adjusted column. Then, filter for cases where p.adjusted is less than .05.# Filter for only the slope terms
slope_terms <- country_coefficients %>%
filter(term == "year")
# Add p.adjusted column, then filter
slope_terms %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < 0.05)Great work! Notice that there are now only 61 countries with significant trends.
Now that you’ve filtered for countries where the trend is probably not due to chance, you may be interested in countries whose percentage of “yes” votes is changing most quickly over time. Thus, you want to find the countries with the highest and lowest slopes; that is, the estimate column.
Steps
arrange() and desc(), sort the filtered countries to find the countries whose percentage “yes” is most quickly increasing over time.# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
# Sort for the countries increasing most quickly
filtered_countries %>%
arrange(estimate)arrange(), sort to find the countries whose percentage “yes” is most quickly decreasing.In this chapter, you’ll learn to combine multiple related datasets, such as incorporating information about each resolution’s topic into your vote analysis. You’ll also learn how to turn untidy data into tidy data, and see how tidy data can guide your exploration of topics and countries over time.
Theory. Coming soon …
1. Joining datasets
So far in our course on United Nations exploratory data analysis,
2. Processed votes
you’ve been working with this votes_processed dataset, where each row, or observation, represents a pairing of a roll call vote and country. You’ve been treating these roll call votes as interchangeable, paying attention to only the year, country and vote, and summarizing them to draw conclusions. But these resolutions cover a vast range of political and historical issues. In this chapter, you’re going to bring in some context about each resolution, specifically topic information. You’ll do this with the descriptions dataset
3. Descriptions dataset
4. inner_join()
to examine how different countries voted on different topics. This is done with dplyr’s inner_join function. You use the “by” argument to note the two columns they have in common: rcid and session- which are used to match rows together between the tables. You then have all the variables from the original votes_processed dataset included in the new table, including vote, year, and country. You also have all the variables from the descriptions dataset - date, unres, and the topic columns. inner_join combined the information in these two tables so we can examine them together.
5. Let’s practice!
In your exercises, you’ll manipulate this combined dataset using other dplyr operations, such as filtering for all votes related to human rights issues.
In the first chapter, you created the votes_processed dataset, containing information about each country’s votes. You’ll now combine that with the new descriptions dataset, which includes topic information about each country, so that you can analyze votes within particular topics.
To do this, you’ll make use of the inner_join() function from dplyr.
Steps
votes_processed dataset.descriptions dataset.inner_join(), using the rcid and session columns to match them. Save as votes_joined.There are six columns in the descriptions dataset (and therefore in the new joined dataset) that describe the topic of a resolution:
Each contains a 1 if the resolution is related to this topic and a 0 otherwise.
Steps
Filter the votes_joined dataset for votes relating to colonialism.
In an earlier exercise, you graphed the percentage of votes each year where the US voted “yes”. Now you’ll create that same graph, but only for votes related to colonialism.
Steps
votes_joined dataset for only votes by the United States relating to colonialism, then summarize() the percentage of votes that are “yes” within each year. Name the resulting column percent_yes and save the entire data frame as US_co_by_year.geom_line() layer to your ggplot() call to create a line graph of the percentage of “yes” votes on colonialism (percent_yes) cast by the US over time.Theory. Coming soon …
1. Tidy data
Consider this
2. United Kingdom
graph of UN voting trends over time. Like other graphs you’ve made, it maps
3. United Kingdom
“year” to the x-axis,
4. United Kingdom
“percentage yes” to the y-axis,
5. United Kingdom
and “country” to color. This graph, however, is faceted across the six topics, using one sub-graph for each topic. For instance,
6. United Kingdom
one single point on this graph represents
7. United Kingdom
the votes of the United Kingdom on the topic of colonialism in 2001. This useful kind of analysis is possible only with a particular structure of data:
8. Tidy data: topic is a variable
one where each observation, or row, represents a single combination of
9. Tidy data: topic is a variable
country, year, and topic. This allows every observation
10. Tidy data: topic is a variable
in the data to map to one point in your plot. Notice that this data includes a variable called “topic”,
11. Tidy data: topic is a variable
which specifies for each observation whether it relates to colonialism, nuclear weapons, and so on. We call this arrangement “tidy”.
12. Topic is spread across six columns
In the votes_joined dataset you used in the previous exercises, you don’t have a single topic variable, but rather one column for each of the six topics containing a zero or a one. This means there’s no easy way to use dplyr to summarize by topic, or to visualize the results for six topics on the same graph.In order to do that, we need to bring topic into a single variable.
13. Use gather() to bring columns into two
This can be done with the gather function in the tidyr package. gather is a reshaping operation that takes any number of columns and collects them into two: key,
14. Use gather() to bring columns into two
with the original column names, and value,
15. Use gather() to bring columns into two
with the contents of those columns. Notice that this typically increases the number of rows in the data.
16. Use gather() to bring columns into two variables
You can apply the gather function on the votes_joined data to collect topic into one variable. First, you specify that you want to join the m-e through e-c columns : those are the six topic columns in the joined dataset. You then specify the names of the key and value columns: use “topic” to store the key, which then contains the column names, and “has_topic” for the value, which is either 0 or 1. This achieves your goal of constructing a “topic” variable with six possible values. Notice that there are now 6 rows for each vote, one for each topic.In this case, you don’t actually care about rows where “has_topic” is zero. For example, these rows are effectively saying that a roll call vote was not related to m-e, the Palestinian conflict.
17. Use gather() to bring columns into one variable
Thus, you should add one more step where you filter for all the cases where has_topic is 1. Thus, the topic column now describes each of the votes it is associated with. Note that votes with multiple topics may appear multiple times in the dataset.By constructing a country-vote-topic dataset, you’ve now made it possible to group and summarize the data by topic, or to compare all six in the same visualization.
18. Let’s practice!
Many analyses will require this kind of manipulation and restructuring of your data using tidyr and other tools.
Insert plot (comes later)
4.7 Question
According to the tidy data framework, which of the following counts as an observation in this graph?
⬜ A country
⬜ A vote
⬜ A country-vote combination
⬜ A country-topic combination
✅ A country-vote-topic combination
In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic. This will change the data from having six columns (me, nu, di, hr, co, ec) to having two columns (topic and has_topic).
Steps
votes_joined into one column called topic (containing one of me, nu, etc.) and a column called has_topic (containing 0 or 1). Print the result without saving it.has_topic is 0. Perform the pivot_longer() / gather() operation again, but this time also filter for only the rows where the topic in topic describes the vote. Save the result as votes_gathered.There’s one more step of data cleaning to make this more interpretable. Right now, topics are represented by two-letter codes:
So that you can interpret the data more easily, recode the data to replace these codes with their full name. You can do that with dplyr’s recode() function, which replaces values with ones you specify:
Steps
recode() function from dplyr in a mutate() to replace each two-letter code in the votes_gathered data frame with the corresponding full name. Save this as votes_tidied.In previous exercises, you summarized the votes dataset by country, by year, and by country-year combination.
Now that you have topic as an additional variable, you can summarize the votes for each combination of country, year, and topic (e.g. for the United States in 2013 on the topic of nuclear weapons.)
Steps
votes_tidied dataset to the console.summarize() call, compute both the total number of votes (total) and the percentage of “yes” votes (percent_yes) for each combination of country, year, and topic. Save this as by_country_year_topic. Make sure that you ungroup() after summarizing.by_country_year_topic dataset to the console.You can now visualize the trends in percentage “yes” over time for all six topics side-by-side. Here, you’ll visualize them just for the United States.
Steps
by_country_year_topic dataset for just the United States and save the result as US_by_country_year_topic.Theory. Coming soon …
1. Tidy modeling by topic and country
In Chapter 3, you used the broom package to fit a separate linear model for each country that measured the trend of percentage of yes votes over time. This let you find the countries whose rate of agreement was increasing or decreasing most quickly.
2. Detecting a trend by topic
With the new datasets you’ve built in this chapter, you fit these trends within each country and within each topic. For example, you could fit trends for the United Kingdom’s voting behavior within each of these six topics, as seen here.
3. Tidy modeling by country
Recall that there were several steps to fitting a model for each country. You FIRST nested all columns besides country into their own sub-datasets in a list column. You then used map CLICK to fit a linear model to each of these sub-datasets, and then tidied each of them into a table of coefficients. Finally, you used unnest to bring those coefficients back into the main data frame, resulting in a combined table of slopes and intercepts.Now that you have a topic column in your by_year_country_topic summary, there’s only one change you need to make to this workflow to fit a model within each country/topic combination.
4. Tidy modeling by country and topic
In the nest statement, simply nest all columns besides country and topic. The other steps are identical. What results is a table with the estimated coefficients for each specific topic for each country. For example, these rows
5. Tidy modeling by country and topic
where the term equals “year” show the estimated slopes on the topics of Colonialism, Economic development, human rights, and so on within Afghanistan. This dataset will let you explore which countries had the sharpest strongest within particular topics- for example, which country had most changed its voting pattern on the topic of colonialism.This analysis demonstrates the flexibility of the nest, model, and unnest pattern in exploratory analysis. You could have chosen to slice your data in many other ways using alternative data sources, and the tidyr, dplyr, and broom packages will always give you the tools to answer the questions you’re interested in.
6. Let’s practice!
In the last chapter, you constructed a linear model for each country by nesting the data in each country, fitting a model to each dataset, then tidying each model with broom and unnesting the coefficients. The code looked something like this:
Now, you’ll again be modeling change in “percentage” yes over time, but instead of fitting one model for each country, you’ll fit one for each combination of country and topic.
Steps
purrr, tidyr, and broom packages.by_country_year_topic dataset to the console.# Load purrr, tidyr, and broom
library(purrr)
library(tidyr)
library(broom)
# Print by_country_year_topic
by_country_year_topiccountry_topic_coefficients. You can use the provided code as a starting point.country_topic_coefficients dataset to the console.# Fit model on the by_country_year_topic dataset
country_topic_coefficients <- by_country_year_topic %>%
nest(-country, -topic) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
# Print country_topic_coefficients
country_topic_coefficientsGreat work! You can ignore the warning messages in the console for now.
Now you have both the slope and intercept terms for each model. Just as you did in the last chapter with the tidied coefficients, you’ll need to filter for only the slope terms.
You’ll also have to extract only cases that are statistically significant, which means adjusting the p-value for the number of models, and then filtering to include only significant changes.
Steps
country_topic_coefficients data to include only the slope term.p.adjusted column containing adjusted p-values (using the p.adjust() function).country_topic_filtered.4.16 Question
Which combination of country and topic has the steepest downward trend?
⬜ Afghanistan on colonialism
⬜ Malawi on the Palestinian conflict
⬜ Vanuatu on colonialism
✅ Vanuatu on the Palestinian conflict
In the last exercise, you found that over its history, Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.
Let’s examine this country’s voting patterns more closely. Recall that the by_country_year_topic dataset contained one row for each combination of country, year, and topic. You can use that to create a plot of Vanuatu’s voting, faceted by topic.
Steps
by_country_year_topic variable for only Vanuatu’s votes to create a vanuatu_by_country_year_topic object.year on the x-axis and percent_yes on the y-axis, and facet by topic.1. Conclusion
I hope you’ve enjoyed this exploration of the United Nations dataset,
2. Insert title here…
where we cleaned, visualized, and modeled historical data to uncover interesting trends. Note that we barely scratched the surface of what can be discovered from this data.
3. Insert title here…
Beyond looking at the percentage of yes votes, you could analyze what countries tended to agree or disagree with each other. You could use machine learning to predict a country’s vote on a particular resolution. I encourage you to take this voting data and try your own analyses. The best way to improve your skills with these tools and to build good analysis habits is to answer questions that are interesting to you.